TITLE by Prasad Kintali

Tip: You will see quoted sections like this throughout the template to help you construct your report. Make sure that you remove these notes before you finish and submit your project!

Tip: One of the requirements of this project is that your code follows good formatting techniques, including limiting your lines to 80 characters or less. If you’re using RStudio, go into Preferences > Code > Display to set up a margin line to help you keep track of this guideline!

In this project we are going to explore White Wine Quality based on 4898 samples with 13 related variables.

Univariate Plots Section

Tip: In this section, you should perform some preliminary exploration of your dataset. Run some summaries of the data and create univariate plots to understand the structure of the individual variables in your dataset. Don’t forget to add a comment after each plot or closely-related group of plots! There should be multiple code chunks and text sections; the first one below is just to help you get started.

##  [1] "X"                    "fixed.acidity"        "volatile.acidity"    
##  [4] "citric.acid"          "residual.sugar"       "chlorides"           
##  [7] "free.sulfur.dioxide"  "total.sulfur.dioxide" "density"             
## [10] "pH"                   "sulphates"            "alcohol"             
## [13] "quality"
## 'data.frame':    4898 obs. of  13 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
##  $ volatile.acidity    : num  0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
##  $ citric.acid         : num  0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
##  $ residual.sugar      : num  20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
##  $ chlorides           : num  0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
##  $ free.sulfur.dioxide : num  45 14 30 47 47 30 30 45 14 28 ...
##  $ total.sulfur.dioxide: num  170 132 97 186 186 97 136 170 132 129 ...
##  $ density             : num  1.001 0.994 0.995 0.996 0.996 ...
##  $ pH                  : num  3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
##  $ sulphates           : num  0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
##  $ alcohol             : num  8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
##  $ quality             : int  6 6 6 6 6 6 6 6 6 6 ...
##        X        fixed.acidity    volatile.acidity  citric.acid    
##  Min.   :   1   Min.   : 3.800   Min.   :0.0800   Min.   :0.0000  
##  1st Qu.:1225   1st Qu.: 6.300   1st Qu.:0.2100   1st Qu.:0.2700  
##  Median :2450   Median : 6.800   Median :0.2600   Median :0.3200  
##  Mean   :2450   Mean   : 6.855   Mean   :0.2782   Mean   :0.3342  
##  3rd Qu.:3674   3rd Qu.: 7.300   3rd Qu.:0.3200   3rd Qu.:0.3900  
##  Max.   :4898   Max.   :14.200   Max.   :1.1000   Max.   :1.6600  
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.600   Min.   :0.00900   Min.   :  2.00     
##  1st Qu.: 1.700   1st Qu.:0.03600   1st Qu.: 23.00     
##  Median : 5.200   Median :0.04300   Median : 34.00     
##  Mean   : 6.391   Mean   :0.04577   Mean   : 35.31     
##  3rd Qu.: 9.900   3rd Qu.:0.05000   3rd Qu.: 46.00     
##  Max.   :65.800   Max.   :0.34600   Max.   :289.00     
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  9.0        Min.   :0.9871   Min.   :2.720   Min.   :0.2200  
##  1st Qu.:108.0        1st Qu.:0.9917   1st Qu.:3.090   1st Qu.:0.4100  
##  Median :134.0        Median :0.9937   Median :3.180   Median :0.4700  
##  Mean   :138.4        Mean   :0.9940   Mean   :3.188   Mean   :0.4898  
##  3rd Qu.:167.0        3rd Qu.:0.9961   3rd Qu.:3.280   3rd Qu.:0.5500  
##  Max.   :440.0        Max.   :1.0390   Max.   :3.820   Max.   :1.0800  
##     alcohol         quality     
##  Min.   : 8.00   Min.   :3.000  
##  1st Qu.: 9.50   1st Qu.:5.000  
##  Median :10.40   Median :6.000  
##  Mean   :10.51   Mean   :5.878  
##  3rd Qu.:11.40   3rd Qu.:6.000  
##  Max.   :14.20   Max.   :9.000

The wine quality is spread on the scale from 1 to 10 with normal distribution.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.800   6.300   6.800   6.855   7.300  14.200

The fixed acidity is normal distributed with mean 6.855. The first plot shows outlines after 10 and major data is in between 6.5 to 7.5, so the second plot was made to eliminate the outliners.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0800  0.2100  0.2600  0.2782  0.3200  1.1000

Volatile acidity spread from 0 through 0.9 with long tail and mean. So to understand the long tail better the plot was tranformed which discards few outliners and made bars spread across.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.00    9.50   10.40   10.51   11.40   14.20

Alcohol is shewed normal distribution with most white wines are made of alcohol level 9.5 and the mean 10.51.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.2200  0.4100  0.4700  0.4898  0.5500  1.0800

Sulpahtes are well distributed with mean 0.47. There are few outliners which are not big deal and were elimiated in the next plot.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9871  0.9917  0.9937  0.9940  0.9961  1.0390

The density spread as normal distribution with median 0.9937. I think there is an outliner with max value 1.039, and the second plot eliminates it.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.720   3.090   3.180   3.188   3.280   3.820

The pH also distributed normally with 3.18 and seems theren’t any outliners even after the second plot.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.600   1.700   5.200   6.391   9.900  65.800

Looks like the distribution of residual sugar not not normal and there are spike at 2. Also there is an outliner at 65.8.

The second plot looks more clean with binormal distribution. Which are doesn’t shows any outliners.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00900 0.03600 0.04300 0.04577 0.05000 0.34600

The chlorides distribution look normal with an outliner, but the second plot eliminates the outliner and shows distribution looks nice.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.2700  0.3200  0.3342  0.3900  1.6600

Citric acid is normal distributed but there is an outliner. More number of wines are made around 0.3 citric acid and also there is a spike at 4.9.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    2.00   23.00   34.00   35.31   46.00  289.00

The few sulfar dioxide features also similar to chlories with outliners and normal distribution.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     9.0   108.0   134.0   138.4   167.0   440.0

Total sulfar dioxide also normally distributed with ouliner at 440 which is eliminated in the second plot.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   4.110   6.570   7.070   7.133   7.590  14.470

Here are the combined plots.

  1. Fist plot with outliners

  1. Second plot without outliners

The above 2 plots shows the combining all features counts in a single plot with and without modifications.

Univariate Analysis

What is the structure of your dataset?

The white wine data structure contains 4898 samples with 12 features which are directly or indirectly depend on the quality of the wine.The features are fixed.acidity, volatile.acidity, citric.acid, residual.sugar, chlorides, free.sulfur.dioxide, total.sulfur.dioxide, density, pH, sulphates, alcohol and quality

The quality of the wine is varies from scale 3 to 9 based on above features.

All the above univariate plots are either normal distributed or skewed with few outliners

What is/are the main feature(s) of interest in your dataset?

Obviously the main feature of the dataset is wine quality. As the quality is depends on the several features in the dataset, we need to explore them.

What other features in the dataset do you think will help support your
investigation into your feature(s) of interest?

As per my search in internet the quality of the wine is based on sensory data, and the rest are on chemical properties of the wines including density, acidity, alcohol content etc

Did you create any new variables from existing variables in the dataset?

Yes. Created total acitidy which is the sum of volatile and fixed acicity

Of the features you investigated, were there any unusual distributions?
Did you perform any operations on the data to tidy, adjust, or change the form
of the data? If so, why did you do this?

Yes, I did few additional operations to eliminate the outliners. In addition to that, I have identified some unusual distribution in residual sugar, so I have applied log 10 and the data is now bimodal distributed. Rest of the features are either normal distributed or skewed.

Bivariate Plots Section

ScatterPlot Matrix:

##                                 X fixed.acidity volatile.acidity
## X                     1.000000000   -0.25581431      0.002857966
## fixed.acidity        -0.255814305    1.00000000     -0.022697290
## volatile.acidity      0.002857966   -0.02269729      1.000000000
## citric.acid          -0.149899918    0.28918070     -0.149471811
## residual.sugar        0.006623775    0.08902070      0.064286060
## chlorides            -0.045645192    0.02308564      0.070511571
## free.sulfur.dioxide  -0.011928911   -0.04939586     -0.097011939
## total.sulfur.dioxide -0.161979037    0.09106976      0.089260504
## density              -0.185976097    0.26533101      0.027113845
## pH                   -0.115774132   -0.42585829     -0.031915368
## sulphates             0.009807759   -0.01714299     -0.035728147
## alcohol               0.213656245   -0.12088112      0.067717943
## quality               0.035763247   -0.11366283     -0.194722969
## total.acidity        -0.254350594    0.99290766      0.096321153
##                       citric.acid residual.sugar   chlorides
## X                    -0.149899918    0.006623775 -0.04564519
## fixed.acidity         0.289180698    0.089020701  0.02308564
## volatile.acidity     -0.149471811    0.064286060  0.07051157
## citric.acid           1.000000000    0.094211624  0.11436445
## residual.sugar        0.094211624    1.000000000  0.08868454
## chlorides             0.114364448    0.088684536  1.00000000
## free.sulfur.dioxide   0.094077221    0.299098354  0.10139235
## total.sulfur.dioxide  0.121130798    0.401439311  0.19891030
## density               0.149502571    0.838966455  0.25721132
## pH                   -0.163748211   -0.194133454 -0.09043946
## sulphates             0.062330940   -0.026664366  0.01676288
## alcohol              -0.075728730   -0.450631222 -0.36018871
## quality              -0.009209091   -0.097576829 -0.20993441
## total.acidity         0.270135269    0.096274432  0.03136937
##                      free.sulfur.dioxide total.sulfur.dioxide     density
## X                          -0.0119289106         -0.161979037 -0.18597610
## fixed.acidity              -0.0493958591          0.091069756  0.26533101
## volatile.acidity           -0.0970119393          0.089260504  0.02711385
## citric.acid                 0.0940772210          0.121130798  0.14950257
## residual.sugar              0.2990983537          0.401439311  0.83896645
## chlorides                   0.1013923521          0.198910300  0.25721132
## free.sulfur.dioxide         1.0000000000          0.615500965  0.29421041
## total.sulfur.dioxide        0.6155009650          1.000000000  0.52988132
## density                     0.2942104109          0.529881324  1.00000000
## pH                         -0.0006177961          0.002320972 -0.09359149
## sulphates                   0.0592172458          0.134562367  0.07449315
## alcohol                    -0.2501039415         -0.448892102 -0.78013762
## quality                     0.0081580671         -0.174737218 -0.30712331
## total.acidity              -0.0607153894          0.101284413  0.26738970
##                                 pH    sulphates     alcohol      quality
## X                    -0.1157741316  0.009807759  0.21365624  0.035763247
## fixed.acidity        -0.4258582910 -0.017142985 -0.12088112 -0.113662831
## volatile.acidity     -0.0319153683 -0.035728147  0.06771794 -0.194722969
## citric.acid          -0.1637482114  0.062330940 -0.07572873 -0.009209091
## residual.sugar       -0.1941334540 -0.026664366 -0.45063122 -0.097576829
## chlorides            -0.0904394560  0.016762884 -0.36018871 -0.209934411
## free.sulfur.dioxide  -0.0006177961  0.059217246 -0.25010394  0.008158067
## total.sulfur.dioxide  0.0023209718  0.134562367 -0.44889210 -0.174737218
## density              -0.0935914935  0.074493149 -0.78013762 -0.307123313
## pH                    1.0000000000  0.155951497  0.12143210  0.099427246
## sulphates             0.1559514973  1.000000000 -0.01743277  0.053677877
## alcohol               0.1214320987 -0.017432772  1.00000000  0.435574715
## quality               0.0994272457  0.053677877  0.43557472  1.000000000
## total.acidity        -0.4277827423 -0.021316418 -0.11229714 -0.136319694
##                      total.acidity
## X                      -0.25435059
## fixed.acidity           0.99290766
## volatile.acidity        0.09632115
## citric.acid             0.27013527
## residual.sugar          0.09627443
## chlorides               0.03136937
## free.sulfur.dioxide    -0.06071539
## total.sulfur.dioxide    0.10128441
## density                 0.26738970
## pH                     -0.42778274
## sulphates              -0.02131642
## alcohol                -0.11229714
## quality                -0.13631969
## total.acidity           1.00000000

There are lot of unexpected correlated coefficients between few features. So lets eliminate the non correlated coefficient features X, volatale acidity, citric acid, sulphates and quality and draw the ScatterPlot Matrix again.

##                      fixed.acidity residual.sugar   chlorides
## fixed.acidity           1.00000000     0.08902070  0.02308564
## residual.sugar          0.08902070     1.00000000  0.08868454
## chlorides               0.02308564     0.08868454  1.00000000
## free.sulfur.dioxide    -0.04939586     0.29909835  0.10139235
## total.sulfur.dioxide    0.09106976     0.40143931  0.19891030
## density                 0.26533101     0.83896645  0.25721132
## pH                     -0.42585829    -0.19413345 -0.09043946
## alcohol                -0.12088112    -0.45063122 -0.36018871
## total.acidity           0.99290766     0.09627443  0.03136937
##                      free.sulfur.dioxide total.sulfur.dioxide     density
## fixed.acidity              -0.0493958591          0.091069756  0.26533101
## residual.sugar              0.2990983537          0.401439311  0.83896645
## chlorides                   0.1013923521          0.198910300  0.25721132
## free.sulfur.dioxide         1.0000000000          0.615500965  0.29421041
## total.sulfur.dioxide        0.6155009650          1.000000000  0.52988132
## density                     0.2942104109          0.529881324  1.00000000
## pH                         -0.0006177961          0.002320972 -0.09359149
## alcohol                    -0.2501039415         -0.448892102 -0.78013762
## total.acidity              -0.0607153894          0.101284413  0.26738970
##                                 pH    alcohol total.acidity
## fixed.acidity        -0.4258582910 -0.1208811    0.99290766
## residual.sugar       -0.1941334540 -0.4506312    0.09627443
## chlorides            -0.0904394560 -0.3601887    0.03136937
## free.sulfur.dioxide  -0.0006177961 -0.2501039   -0.06071539
## total.sulfur.dioxide  0.0023209718 -0.4488921    0.10128441
## density              -0.0935914935 -0.7801376    0.26738970
## pH                    1.0000000000  0.1214321   -0.42778274
## alcohol               0.1214320987  1.0000000   -0.11229714
## total.acidity        -0.4277827423 -0.1122971    1.00000000

I want provide my finding of Whitw Wine data set from the above Scatter Plot as below.

  1. As mentioned above there is no correlation in volatale acidity, citric acid, sulphates and quality features.
  2. The highest correlation is in between Density and Residual Sugar.
  3. Density appears correlatable with Residual Sugars, Total Sulfur Dioxide and Alcohol

Let’s explore the relation between few correlated and non correlated features using Bivariate plots.

Density by Residual Sugar:

Above plot clearly shows a strong relationship between density and residual sugar as shown the correlation coefficient 0.839.

Density by Alcohol:

We can also see strong relationship between density and alcohol with correlation coef. -0.78.

Quality by Chlorides:

The above plot proves that there is no correlation between quality and chlorides

Quality by Chlorides using histogram:

## quality: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.02200 0.03625 0.04100 0.05430 0.05400 0.24400 
## -------------------------------------------------------- 
## quality: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0130  0.0380  0.0460  0.0501  0.0540  0.2900 
## -------------------------------------------------------- 
## quality: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00900 0.04000 0.04700 0.05155 0.05300 0.34600 
## -------------------------------------------------------- 
## quality: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.01500 0.03600 0.04300 0.04522 0.04900 0.25500 
## -------------------------------------------------------- 
## quality: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.01200 0.03100 0.03700 0.03819 0.04400 0.13500 
## -------------------------------------------------------- 
## quality: 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.01400 0.03000 0.03600 0.03831 0.04400 0.12100 
## -------------------------------------------------------- 
## quality: 9
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0180  0.0210  0.0310  0.0274  0.0320  0.0350
## quality: 3
## [1] 1.086
## -------------------------------------------------------- 
## quality: 4
## [1] 8.166
## -------------------------------------------------------- 
## quality: 5
## [1] 75.103
## -------------------------------------------------------- 
## quality: 6
## [1] 99.388
## -------------------------------------------------------- 
## quality: 7
## [1] 33.608
## -------------------------------------------------------- 
## quality: 8
## [1] 6.705
## -------------------------------------------------------- 
## quality: 9
## [1] 0.137

The histogram shows the quality is better for medium concentrated cholide wines.

Quality by Total Acidity

The above plot proves that there is no correlation between quality and total acidity.

Quality by Total Acidity using Histogram

## quality: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   4.415   6.820   7.705   7.933   8.857  12.030 
## -------------------------------------------------------- 
## quality: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   5.450   6.745   7.310   7.511   7.920  10.910 
## -------------------------------------------------------- 
## quality: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   4.690   6.660   7.140   7.236   7.730  10.550 
## -------------------------------------------------------- 
## quality: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   4.110   6.550   7.030   7.098   7.567  14.470 
## -------------------------------------------------------- 
## quality: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   4.370   6.505   6.980   6.997   7.460   9.450 
## -------------------------------------------------------- 
## quality: 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   4.125   6.475   7.040   6.935   7.490   8.570 
## -------------------------------------------------------- 
## quality: 9
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   6.960   7.260   7.360   7.718   7.640   9.370
## quality: 3
## [1] 158.665
## -------------------------------------------------------- 
## quality: 4
## [1] 1224.24
## -------------------------------------------------------- 
## quality: 5
## [1] 10542.83
## -------------------------------------------------------- 
## quality: 6
## [1] 15601.92
## -------------------------------------------------------- 
## quality: 7
## [1] 6157.785
## -------------------------------------------------------- 
## quality: 8
## [1] 1213.545
## -------------------------------------------------------- 
## quality: 9
## [1] 38.59

The quality of wine is better for medium total acitidy wines.

Quality by Density

Again the correlation between the quality and density is very week.

Quality by Density using Histogram:

## $title
## [1] "Quality by density"
## 
## $subtitle
## NULL
## 
## attr(,"class")
## [1] "labels"
## $title
## [1] "Quality by density"
## 
## $subtitle
## NULL
## 
## attr(,"class")
## [1] "labels"

## quality: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9911  0.9925  0.9944  0.9949  0.9969  1.0001 
## -------------------------------------------------------- 
## quality: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9892  0.9926  0.9941  0.9943  0.9958  1.0004 
## -------------------------------------------------------- 
## quality: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9872  0.9933  0.9953  0.9953  0.9972  1.0024 
## -------------------------------------------------------- 
## quality: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9876  0.9917  0.9937  0.9940  0.9959  1.0390 
## -------------------------------------------------------- 
## quality: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9871  0.9906  0.9918  0.9925  0.9937  1.0004 
## -------------------------------------------------------- 
## quality: 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9871  0.9903  0.9916  0.9922  0.9935  1.0006 
## -------------------------------------------------------- 
## quality: 9
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9897  0.9898  0.9903  0.9915  0.9906  0.9970
## quality: 3
## [1] 19.89768
## -------------------------------------------------------- 
## quality: 4
## [1] 162.0671
## -------------------------------------------------------- 
## quality: 5
## [1] 1450.098
## -------------------------------------------------------- 
## quality: 6
## [1] 2184.727
## -------------------------------------------------------- 
## quality: 7
## [1] 873.3581
## -------------------------------------------------------- 
## quality: 8
## [1] 173.6413
## -------------------------------------------------------- 
## quality: 9
## [1] 4.9573

Density and quality have a loose negative correlation of 0.307. That is reflecting in boxplot. After removing the top 1% of outliers, the jitter chart shows a downward trend from left to right. Again, the boxplot supports this assertion because as quality increases from 5 to 9, the quartile ranges for alcohol steadily decrease. We now know that higher quality ratings are associated with lower density values.

Quality by Alcohol:

Looks like the quality have a good dependency on alocohol with the correlation coeficient 0.403

Quality by Alcohol using Histogram:

## quality: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.00    9.55   10.45   10.35   11.00   12.60 
## -------------------------------------------------------- 
## quality: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.40    9.40   10.10   10.15   10.75   13.50 
## -------------------------------------------------------- 
## quality: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   8.000   9.200   9.500   9.809  10.300  13.600 
## -------------------------------------------------------- 
## quality: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.50    9.60   10.50   10.58   11.40   14.00 
## -------------------------------------------------------- 
## quality: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.60   10.60   11.40   11.37   12.30   14.20 
## -------------------------------------------------------- 
## quality: 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.50   11.00   12.00   11.64   12.60   14.00 
## -------------------------------------------------------- 
## quality: 9
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   10.40   12.40   12.50   12.18   12.70   12.90
## quality: 3
## [1] 206.9
## -------------------------------------------------------- 
## quality: 4
## [1] 1654.85
## -------------------------------------------------------- 
## quality: 5
## [1] 14291.48
## -------------------------------------------------------- 
## quality: 6
## [1] 23244.67
## -------------------------------------------------------- 
## quality: 7
## [1] 10003.78
## -------------------------------------------------------- 
## quality: 8
## [1] 2036.3
## -------------------------------------------------------- 
## quality: 9
## [1] 60.9

More white wines are made with alcohol levelaround 9. But when we look at the histogram the quality increases when alcohol level increases which supports the correlation coeffient 0.435.

Bivariate Analysis

Bivariate Analysis Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset? The main relationships in this bivariate analysis are found related with the alcohol feature. We could see that it has a strong relationship with the density and the residual sugar.

But no single relationship (at leats remarkable) could be found with the quality. Each of the features analyzed aren’t somehow related with the quality. This is something we can expected because is not that easy to have a good wine quality, isn’t it?

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)? The most interesting relationships involve the density feature. In fact seeing the correlations between features, density has almost always the highest values.

What was the strongest relationship you found? The strongest relationship is between density and residual sugar. A correlation of 0.84 gives us a strong relationship. Also density with alcohol (-0.78) are strongly correlated.

Multivariate Plots Section

The above mutli plot diagram shows the relationship between the density and alcohol for individual quality levels.

Creating mtable for white wines dataset.

## 
## Calls:
## m1: lm(formula = I(quality) ~ I(alcohol^(1/3)), data = wine)
## m2: lm(formula = I(quality) ~ I(alcohol^(1/3)) + chlorides, data = wine)
## m3: lm(formula = I(quality) ~ I(alcohol^(1/3)) + chlorides + density, 
##     data = wine)
## m4: lm(formula = I(quality) ~ I(alcohol^(1/3)) + chlorides + density + 
##     pH, data = wine)
## m5: lm(formula = I(quality) ~ I(alcohol^(1/3)) + chlorides + density + 
##     pH + sulphates, data = wine)
## m6: lm(formula = I(quality) ~ I(alcohol^(1/3)) + chlorides + density + 
##     pH + total.acidity, data = wine)
## 
## ========================================================================================================
##                          m1            m2            m3            m4            m5            m6       
## --------------------------------------------------------------------------------------------------------
##   (Intercept)          -4.065***     -3.442***    -28.461***    -29.003***    -26.493***    -44.608***  
##                        (0.296)       (0.327)       (6.483)       (6.479)       (6.501)       (6.733)    
##   I(alcohol^(1/3))      4.545***      4.313***      4.980***      4.929***      4.875***      5.315***  
##                        (0.135)       (0.145)       (0.225)       (0.226)       (0.226)       (0.230)    
##   chlorides                          -2.482***     -2.382***     -2.295***     -2.349***     -2.376***  
##                                      (0.559)       (0.559)       (0.559)       (0.558)       (0.556)    
##   density                                          23.694***     23.554***     21.109***     40.242***  
##                                                    (6.132)       (6.126)       (6.149)       (6.442)    
##   pH                                                              0.248**       0.200**      -0.047     
##                                                                  (0.076)       (0.077)       (0.084)    
##   sulphates                                                                     0.398***                
##                                                                                (0.101)                  
##   total.acidity                                                                              -0.124***  
##                                                                                              (0.016)    
## --------------------------------------------------------------------------------------------------------
##   R-squared             0.187         0.191         0.193         0.195         0.198         0.205     
##   adj. R-squared        0.187         0.190         0.193         0.194         0.197         0.204     
##   sigma                 0.798         0.797         0.796         0.795         0.794         0.790     
##   F                  1129.793       576.903       390.673       296.255       240.790       252.560     
##   p                     0.000         0.000         0.000         0.000         0.000         0.000     
##   Log-likelihood    -5846.130     -5836.294     -5828.835     -5823.494     -5815.779     -5792.251     
##   Deviance           3120.832      3108.324      3098.870      3092.119      3082.394      3052.922     
##   AIC               11698.259     11680.589     11667.669     11658.987     11645.558     11598.502     
##   BIC               11717.749     11706.575     11700.152     11697.967     11691.034     11643.978     
##   N                  4898          4898          4898          4898          4898          4898         
## ========================================================================================================

Multivariate Analysis

Talk about some of the relationships you observed in this part of the
investigation. Were there features that strengthened each other in terms of
looking at your feature(s) of interest?

As we could saw in the bivariate section, density with residual sugar and alcohol have a big correlation and as we can appreciate this happens with every wine quality.

Furthermore, a small relationship appears when mixing total acidity with residual sugar and alcohol. In the linear model a 0.2 appears for the R-squared value. This means a 20% of the quality variance could accounted.

Were there any interesting or surprising interactions between features?

As said before, the most interesting feature is the density, analyzed with alcohol and residual sugar. No special interaction could be seen in this section.

OPTIONAL: Did you create any models with your dataset? Discuss the
strengths and limitations of your model.


Final Plots and Summary

Plot One

Description One

The distribution of residual sugar amount appears to be bimodal. This is not easy to explain, maybe a demand of a well differenced wine sweet flavour. However it exists an official category for the sweetness of the wines but the are almost outliers in this data set:

Plot Two

## $title
## [1] "Histogram of Density with color set by Quality"
## 
## $subtitle
## NULL
## 
## attr(,"class")
## [1] "labels"

Description Two

This histogram provides a better visualization of the relationship between density and quality. Since there are some outliers for density, I removed the top and bottom 1% from the chart.

Since the color is set by the quality, each quality value has a unique impact on the overall histogram, and I can draw insight from these distributions. The center of the distributions shift to the left as the color changes darker blue. This means that the main concentration of density values decreases as the quality increases.

Summarizing the data by quality supports this assertion: The median density steadily decreases from 0.9953 to 0.9903 as the quality value increases from “poor” (5) to “good” (9). This is similar to what I noticed when evaluating the correlation between alcohol and quality in the last plot. I realized that I needed to investigate the relationship between density and alcohol.

Plot Three

Description Three

This plot reflects the relationship between density, alcohol, and sugar in a single visualization. I split the residual sugar values into two buckets delineated by the median value of 5.2 in order to see the trends more clearly.

I can see that as the alcohol level increases, the density decreases because the scatterplot has a downward trend to the right. This suggests that alcohol is one of the less dense ingredients in wine. Also, the sugar red/blue coloring shows that as the sugar increases, the density also increases, since the blue dots are higher on the chart than the red dots. This suggests that sugar is one of the more dense ingredients in wine. Thirdly, there is a heavier concentration of blue dots on the left side of the chart than the right side, which means that lower alcohol levels are associated with higher levels of sugar. The correlation values between these variables support all of these insights from the chart.

I investigated the wine-making process in order to better understand the relationship between these features. Fermentation converts the sugars to alcohol, so the conclusions from this chart make logical sense. This was interesting to me, because the data helped me understand how wine is created.


Reflection

The white wines data set contains information on almost 5000 wines. First of all an exploratory data analysis was performed to understand the fearures. Also some internet investigation to contextualize and learn about the topic. This gave me some references about how quality could be calculated/predicted given some of the features already provided in the dataset. Before this some relations call my attention like the high relationship of the density with some other features like alcohol and residual sugar. Finally trying to find any relations to set a good quality was quite frustrating. Some internet investigations direct me to this formula: Sweet Taste (sugars + alcohols) <= => Acid Taste (acids). But the final thought wasn’t as easy as it seems. I could find a small relationship between this features but looking at the resultant linear model a small qualtity of wines are accounted (21%).

Some conclusions I can extract is that the data set lacks of a more spreaded quality values. Almost all the wines are ‘NORMAL’ and it’s difficult the clusterize. Also I think that my analysis was a bit biased trying to predict the quality given the previous formula.

In a next iteration or further analysis the first thing to come with is the strange peak saw in the citric acid histogram. Another possible way to drive a new analysis is including another features for the final modeling, trying to increase the percent of wines accounted.

Bibliography

Feature Knowledge acidity http://www.calwineries.com/learn/wine-chemistry/acidity http://winemakersacademy.com/understanding-wine-acidity/

volatile acidity http://extension.psu.edu/food/enology/wine-production/volatile-acidity-in-wine

citric acid http://www.calwineries.com/learn/wine-chemistry/wine-acids/citric-acid

residual sugar http://www.calwineries.com/learn/wine-chemistry/sugar-in-wine

alcohol http://www.calwineries.com/learn/wine-chemistry/alcohol

Some thoughts